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Abstract. A cover of a string x = £c[l..n] is a proper substring u of x 
such that x can be constructed from possibly overlapping instances of u. 

A recent paper m relaxes this definition — an enhanced cover u of x 
is a border of x (that is, a proper prefix that is also a suffix) that covers a 
maximum number of positions in x (not necessarily all) — and proposes 
efficient algorithms for the computation of enhanced covers. These algo¬ 
rithms depend on the prior computation of the border array /3[l..n], 
where f3 [i] is the length of the longest border of x[l ..*], 1 < i < n. In this 
paper, we first show how to compute enhanced covers using instead the 
prefix table: an array 7r[l..n] such that n[i] is the length of the longest 
substring of x beginning at position i that matches a prefix of x. Unlike 
the border array, the prefix table is robust: its properties hold also for 
indeterminate strings — that is, strings defined on subsets of the al¬ 
phabet £ rather than individual elements of £. Thus, our algorithms, in 
addition to being faster in practice and more space-efficient than those 
of [12], allow us to easily extend the computation of enhanced covers to 
indeterminate strings. Both for regular and indeterminate strings, our 
algorithms execute in expected linear time. Along the way we establish 
an important theoretical result: that the expected maximum length of 
any border of any prefix of a regular string x is approximately 1.64 for 
binary alphabets, less for larger ones. 
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1 Introduction 

The concept of periodicity is fundamental to combinatorics on words and 
related algorithms: it is difficult to imagine a research contribution that 
does not somehow involve periods of strings. But periodicity alone may 
not be the best descriptor of a string; for example, x = abaababab , a string 
of length n = 9, has period 7 and corresponding generator {^] abaabab, 
but it could well be more interesting that every position but one in x lies 
within an occurrence of ab. In 1990 Apostolico & Ehrenfeucht [3] intro¬ 
duced the idea of quasiperiodicity: a quasiperiod or cover of a string 
a: is a proper substring u of x such that any position in x is contained 
in an occurrence of tt; u is then said to cover x, which is said to be 
quasiperiodic. Thus, for example, u = aba is a cover of x = ababaaba. 
Several linear-time algorithms were proposed for the computation of cov¬ 
ers mnnss, culminating in an algorithm JT 6 ] to compute the cover 
array 7 , where 7 [z] gives the length j of the longest cover of x[l..i\. 
Since the longest cover of x[l..j] is also a cover of £c[l..z], 7 implicitly 
specifies all the covers of every prefix of x. A recent paper [2] extends the 
computation of 7 to “indeterminate strings” (see below for definition). 

Even though the cover of a string can provide useful information, 
quasiperiodic strings are on the other hand infrequent among strings in 
general. Another approach to string covering was therefore proposed in 
m : a set Uk = Uk(x) of strings, each of length k, is said to be a mini¬ 
mum k- cover of x if every position in x lies within some occurrence of 
an element of Uk, and no smaller set of fc-strings has this property. Thus 
U2(abaababab) = U2{ababaaba) = {ab,ba}. In [10J the computation of Uk 
was shown to be NP-complete, though an approximate polynomial-time 
algorithm was presented in |14| . 

Recall that a border of 2 : is a possibly empty proper prefix of x 
that is also a suffix: every nonempty string has a border of length zero. 
Recently the promising idea of an enhanced cover was introduced P 2 ]; 
that is, a border u of x = x[l ,.n] that covers a maximum number m < n 
of positions in x. Then the minimum enhanced cover mec(a:) is the 
shortest border u that covers m positions, and [T5J presented an algorithm 
to compute mec(®) in <9(n) time. Thus for x = abaababab, mec(*) = ab. 
Further, on the analogy of the cover array defined above, the authors 
proposed the minimum enhanced cover array MEC X - — for every i E 
l..n, MECjcfi] = |mec(£c[l..i])|, the length of the minimum enhanced cover 
of *[!..*] — and showed how to compute it in O(nlogn) time. In this 


Notation and terminology generally follow m- 
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paper we introduce in addition the CMEC array, where CMEC[i] specifies the 
number of positions in x covered by the border of length MEC[z]. Thus, for 
example, MEC abaababab = 001123232 and CHEC abaababab = 002346688. 

In order to compute MEC#, the authors of [12] made use of a variant of 
the border array — that is, an integer array /3[l..n] in which for every 
i E l..n, /3[i\ is the length of the longest border of x[l..i\. In this paper 
we adopt a different approach to the computation of MEC#, using, instead 
of a border array, the prefix table 7v = 7 r[l..n], where for every i E l..n, 
7 r [i] is the length of the longest substring at position i of x that equals a 
prefix of an It has long been folklore that /3 and n are “equivalent”, but it 
has only recently been made explicit [ 6 ] that each can be computed from 
the other in linear time. However, this equivalence holds only for regular 
strings x in which each entry x[i\ is constrained to be a single element of 
the underlying alphabet U. 

We say that a letter A is indeterminate if it is any nonempty subset 
of A7, and thus a string x is said to be indeterminate if some constituent 
letter x[i\ is indeterminate. The idea of an indeterminate string was first 
introduced in na — with letters constrained to be either regular (single 
elements of U) or S itself — and the properties of these strings have 
been much studied by Blanchet-Sadri [7] and her collaborators as “partial 
words” or “strings with holes”. Indeterminate strings can model DNA 
sequences on U = {A, C, G, T} when ambiguities arise in determining 
individual nucleotides (letters). 

Two indeterminate letters A and y are said to match (written A ~ y) 
whenever Afl/i 7 ^ 0, a relation that is in general nontransitive E3223: 
a « {a, b} and {a, b} « b, but a 56 b. An important consequence of this 
nontransitivity is that the border array no longer correctly describes all of 
the borders of x: it is no longer necessarily true, as for regular strings, that 
if u is the longest border of v, in turn the longest border of x , then u is a 
border of x. On the other hand, the prefix array retains all its properties 
for indeterminate strings x and, in particular, correctly identifies all the 
borders of every prefix of x 0 . m describes algorithms to compute the 
prefix table of an indeterminate string; conversely, |9] proves that there 
exists an indeterminate string corresponding to every feasible prefix table, 
while [I] describes an algorithm to compute the lexicographically least 
indeterminate string determined by any given feasible prefix table. There 
is thus a many-many relationship between the set of all indeterminate 
strings and the set of all prefix tables. Consequently, computing MEC® (or 
simply MEC when there is no ambiguity) from the prefix table 7r = -Kx 
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rather than from a variant of the border array allows us to extend the 
computation to indeterminate strings. 

In Section [2] we outline the basic methodology and data structures 
used to compute the minimum enhanced cover array from the prefix ta¬ 
ble, while illustrating the ideas with an example. Then Section [3] provides 
a proof of the algorithm’s correctness, as well as an analysis of its com¬ 
plexity, both worst and average case. In Section [4] we discuss the practical 
application of our algorithms, in terms of time and space requirements, 
and compare our prefix-based implementation with the border-based im¬ 
plementation of |12j . Section [5] extends the enhanced cover array algo¬ 
rithm to indeteterminate strings (for rooted covers) and outlines various 
other extensions, particularly to generalizations of MECs. 

2 Methodology 

In this section we describe the computation of MEC^;, the enhanced cover 
array of x, based on the prefix array 7v. Since every minimum enhanced 
cover of x is also a border of a;, we are initially interested in the covers 
of prefixes of x. For this purpose we need arrays whose size is B, the 
maximum length of any border of any prefix of x. Noting that B must be 
the maximum entry in the prefix array 7 r, we write B = max 2 <i< n 7 r[i]. 

Definition 1 In the maximum no cover array MNC = MNC[1..B], for 
every q G 1..B, MNC[g] = q' , where q' is the maximum integer in l..q such 
that the prefix x[l..q'] has no cover — that is, such that 7 [q 1 ] = 0. 

As shown in Figure[lJ once B is computed in 0(n) time from the prefix 
array 7 r, MNC can be easily computed in 0(B) time using the cover array 
7 [ 1 ..B] of jc[1..B]. Note that the entries in MNC are monotone nondecreasing 
with 1 < MNC[ 5 ] < q for every q G 1..B. The following is fundamental to 
the execution of our main algorithm: 

Observation 2 If a prefix v = a:[l..g] of x has a cover u, then v 7 ^ 
mec(*) (since |w| < q and u covers every position covered by v). 

Thus MNC[q] specifies an upper bound q' G l..q on the length of a 
minimum enhanced cover of x. Two other arrays are required for the 
computation, both of length B: 

Definition 3 For every q G 1..B: 

• PR[q] is the rightmost position in x at which the prefix x[l..q] occurs; 
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procedure Compute_MNC(n, 7 r; B, 7 , MNC) 

B <r~ 7t[2] 

for i 3 to n do 

B max(B, vr [i]) 

> Compute 7 [ 1 ..B] of x[ 1..B] using 

> the algorithm Compute_PCR of m- 
Compute_PCR(B, 7 r; 7 ) 

> Note that MNC can overwrite 7 . 
for q <— 1 to B do 

if 7 [q] =0 then MNC[g] <— q 
else MN C[q] MNC[g-l] 

Fig. 1 . Computing MNC from the prefix array 7r[l..n] and the cover array 7[1..B]. 


• CPR[q] is the number of positions in x covered by occurrences ofx[l..q]. 

Here is an example of the arrays introduced thus far: 

12 3 4 5 6789 10 
x = a b a b a a b a b a 
tt= 100 3 0 1 5030 1 

7=00023 

MNC = 1 2 3 3 3 
PR = 10 8 8 6 6 
CPR = 6 8 10 8 10 
MEC = 00 1 2 3 1 23 2 3 
CMEC = 002454688 10 

Note that for a:[1..9] and ;c[1..10], there are actually two borders that cover 
a maximum number of positions; in each case the border of minimum 
length is identified in MEC. 

The algorithm Compute_MEC is shown in Figure [2j In the first stage, 
B and MNC are computed and the arrays CMEC, PR and CPR are initialized. 
Then every position i > 1 such that q = j[i] > 0 is considered. Using 
MNC, the longest prefix Q' = x[l..q'] of ®[l..g] that does not have a cover 
is identified; for prefixes of x\l..q\ that do have a cover, the appropriate 
PR and CPR values have already been updated. There are two main steps 
in the processing of Q': 

• Since i has now become the rightmost occurrence of Q' in x[l..i\, we 
must set PR[^] <— i and increment the corresponding number CPR[g 7 ] 
of positions covered. 
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procedure Compute_MEC(7r; MEC,CMEC) 
n •<— 17T | 

Compute_MNC(n, 7r; B, 7 ,MNC) 

MEC 0 n ; CMEC <- 0 n ; PR 1 B 
for q 4— 1 to B do CPR[q] q 
for i <- 2 to n do 

q 4— 7T [z] 

> x[i..i + q—l]=x[l..q\. 

while q > 0 do 

> x[l..q'] is the longest prefix of x[l..q] without a cover. 

q’ <r- MNC[g] 

> x[l..q’] also occurs at i: update CPR[</] PR[(/]. 

if i— PR[(/] < q' then 

CPR[g'] e- CPR[g , ]+i-PR[g / ] 

else 

CPR [q'] <r- CPR[</]+g' 

PR [q’j i 

> Update CMEC & MEC accordingly for interval i..i+q' — 1. 

if CPR[</] > CMEC[i+g / —1] then 
MEC[i + g'-l] e- q' 
if CPR[</] > CMEC[* + g'-l] then 
CMEC[i + g / — 1] <r- CPR[g'] 
q^q'-l 

Fig. 2. Computing MEC amd CMEC from the prefix array 7r. 


• If the number CPR[r/] of positions covered by occurrences of Q' exceeds 
CMEC[i + g — 1], then CMEC and MEC must be updated accordingly. 

These steps are repeated recursively for the longest proper prefix of Q' 
that does not have a cover. 

3 Correctness Complexity of Compute_MEC 

We begin by proving the correctness of ComputeJMEC, which depends 
on the prior computation of 7r = 7Vx [6]. Consider first procedure Com- 
pute_MNC, where B is computed, followed by the cover array 'y[l..B], Then 
for every q E 1..B, MNC[<7] •<— q whenever there is no cover of x[l..q], with 
MNC[g] MNC[g—1] otherwise, an easy and straightforward calculation. 

ComputeJMEC then independently considers positions i = 2,3,..., n 
for which 7r[i] > 0; that is, such that a border of x of length q = 7r[i] 


7 


begins at i. The internal while loop then processes in decreasing order of 
length the prefixes Q 7 = x[l..q'] of x[l..q] that have no cover — and that 
therefore, by Observation [2j can possibly be minimum enhanced covers of 
x[l..i+q'— 1], Thus, for every i G 2..n, all such borders x[l..q] = x[i..i+q—l] 
are considered and, for each one, all such prefixes C) 7 . For each q': 

• the number CPR[q 7 ] of positions covered by Q 7 is updated, as well as 
the position PR[q 7 ] = i of rightmost occurrence of Q 7 ; 

• MEC[i+q 7 —1] and CMEC[i+q 7 —l] are updated accordingly for sufficiently 
large CPR[q 7 ]. 

We claim therefore that 

Theorem 4 For a given string x, Compute_MEC correctly computes the 
minimum enhanced cover array MEC® and the number CMEC® of positions 
covered by it, based solely on the prefix array 7r®. 

We have seen that in aggregate ComputeJMEC processes a subset of 
the nonempty borders of every prefix *[1.4], devoting 0(1) time to each 
one. As we have seen, each border Q 7 in each such subset is constrained 
to have no cover. We say that a string v is strongly periodic if it has a 
border u such that |tt| > |u|/2; otherwise v is said to be weakly periodic. 
Observe that the borders Q 7 must all be weakly periodic; if not, then they 
would have a cover u with |tt| > |u|/2. In fT2] the following result is 
proved: 

Lemma 5 There are at most log 2 n weakly periodic borders of a string 
of length n. 

It follows then that for each i G 2..n, there are at most log 2 i borders 
considered, thus overall 0(n\ogn) time. 

The space requirement of Compute_MEC, apart from the 7r, MEC and 
CMEC arrays, each of length n, consists of three integer arrays (MNC (over¬ 
writing 7 ), PR, CPR), each of length B < n. Thus 

Theorem 6 In the worst case, Compute^MEC computes MEC and CMEC 
from 7T using 

(a) O(nlogn) time; 

( b ) three additional arrays 1..B of integers 1 ..n, thus @(Blogn) bits of 
space. 

Now consider the expected (average) case behaviour of Compute JVIEC. 
This depends critically on the expected length of the maximum border of 


x[\ ..n]; that is, the expected value of B. We show in the Appendix that for 
a given alphabet size, B approaches a limit as n goes to infinity. The limit 
is approximately 1.64 for binary alphabets, 0.69 for ternary alphabets, 
and monotone decreasing in alphabet size. Thus 


Theorem 7 In the average case, Compute^MEC requires 0(n) time and 
0(logn) additional bits of space. 


4 Comparing Border-Based and Prefix-Based Algorithms 


As has been mentioned above, in order to compute MEC®, the authors of 
[12] made use of the border array. On the other hand Compute_MEC 
is based on the prefix table. We have already highlighted the advantage 
Compute_MEC has because of the use of a prefix table in lieu of a border 
array especially in the context of indeterminate strings. Additionally, 
the simplicity and low space usage of ComputeJMEC encourage us to 
compare its practical performance with the algorithm of m • To this end 
this can be seen as a comparison between a border-based algorithm (i.e., 
the algorithm of m) for computing MEC;e and a prefix-based algorithm 
(i.e, Compute_MEC of the current paper) for doing the same. In what 
follows we will refer to the former algorithm as ECB and the latter as 
ECP. 

We have implemented ECP (i.e., Compute_MEC) in Cff using Visual 
Studio 2010. We got the implementation of ECB from the authors of m- 
However, ECB was implemented in C. To ensure a level playing ground, 
we re-implemented ECB in Cff following their implementation. Then we 
have run both the algorithms on all binary strings of lengths 2 to 30. 
The experiments have been carried out on a Windows Server 2008 R2 
64-bit Operating System, with Intel(R) Core(TM) i7 2600 processor @ 
3.40GHz having an installed memory (RAM) of 8.00 GB. The results are 
illustrated in Figure [3] and [4j where the maximum number of operations 
carried out by each algorithm is reported in Figure |3j Figure [4] shows the 
ratio of the total number of operations performed by the Border-Based 
(ECB) pL2j and Prefix-Based (ECP) algorithm to the length n of string, 
for all strings on the binary alphabet. As is evident from Figure [3] and 
[4] , ECP outperforms ECB and in fact it does show a linear behaviour 
verifying the claim in Theorem [7] above. 
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ECB vs. ECP 



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 


length 


Fig. 3. The maximum number of operations performed by the Border-Based (ECB) 
m and Prefix-Based (ECP) algorithm (i.e., ComputeJMEC) to compute the Minimum 
Enhanced Cover array, for all strings on the binary alphabet. 



Fig. 4. Ratio of the total number of operations performed by the Border-Based (ECB) 
m and Prefix-Based (ECP) algorithm to the length n of string, for all strings on the 
binary alphabet. 






























10 


5 Extensions 

In Sections [2] and [3] we describe an algorithm to compute the minimum 
enhanced cover array MEC® of a given string x, based only on the prefix 
array 7r ®■ As noted in the Introduction, since the prefix array can be 
computed also for indeterminate strings m, this immediately raises the 
possibility of extending the MEC calculation to indeterminate strings. 

In |2] two definitions of “cover” for an indeterminate string are pro¬ 
posed: a sliding cover where adjacent or overlapping covering substrings 
of x must match, and a rooted cover where each covering substring is 
constrained only to match a prefix of x. The nontransitivity of matching 
(see Section [I]) inhibits implementation of a sliding cover, but [2] shows 
how to compute all the rooted covers of indeterminate x from its prefix 
array in 0(n 2 ) worst case time, <9(n) in the average case. Thus it becomes 
possible to execute ComputeJVINC for rooted covers, simply by replacing 
the function call to ComputeJPCR by a function call to PCInd of [2]; 
that is, to compute the rooted cover array 7 h [1 ..B], hence MNC[1..B] and 
thus MEC®, all for indeterminate strings. Let us call this new algorithm 
Compute_MEC_Ind. We recall now a lemma from [5] stating that the ex¬ 
pected number of borders in an indeterminate string is bounded above by 
a constant, approximately 29. Therefore, also for indeterminate strings, B 
can be treated as a constant, and we have the following remarkable result: 


Theorem 8 In the average case, Compute-MEC-Ind requires 0(n) time 
and 0(logn) additional bits of space. 

We note further that the prefix array can be efficiently computed in 
a compressed form [20], taking advantage of the fact that for i e l..n, 
7r[i] 0 if and only if x[i\ = *[1]. Thus we can use two arrays POS and 
LEN to store nonzero positions in tv and the values at those positions, 
respectively, thus saving much space in cases that arise in practice. We 
have designed a POS/LEN version of Compute_MEC that space restrictions 
do not allow us to describe here. 

Finally, [L2] describes extensions of the minimum enhanced cover ar¬ 
ray calculation, as follows: 

• computation of the enhanced left-cover array of x ; 

• computation of the enhanced left-seed array of x. 

Our prefix array approach yields efficient algorithms for these problems 
also, that may similarly be extended to rooted covers of indeterminate 
strings. 
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APPENDIX 

We write \x\ for the length of string x. Here we show that the expected 
length of the longest border of a string x approaches a limit as \x\ tends 
to infinity, the limit depending on the alphabet size. For a binary alpha¬ 
bet it is approximately 1.64. We use the following notation, a = |A7| is 
the alphabet size, B(w) is length of longest border of string w and Bk[w) 
is length of longest border of string w which has length at most k (ie, ig¬ 
noring any borders longer than k). Thus if x = babaabababbabaabab then 
B(x) = 8 since x has longest border babaabab and B^(x) = 3 since the 
longest border of x which has length at most 4 is aba. W n is the set of all 
strings of length n on an alphabet of size a. Since Wo contains only the 
empty string we have | l-Fo | = 1. 


Lemma 9 The number of strings of length n on an alphabet of size a 
which have a border of length k (not necessarily the longest border) is 
a n ~ k . 

Proof. A string with border of length k is periodic with period n — k and 
so is determined by its length n — k prefix. This prefix can be chosen in 
a n ~ k ways. □ 

We also need the following formula (which can be obtained using a com¬ 
puter algebra system). 

Trmrmin^ m T m — cri ’ +1 ( cr (b+l)-a-b-l) o a (ao-a-a) 

Lemma 1U ^ i=a mo - ((7 _i)^ • 

Clearly \W n \ = a n . The expected size of the longest border of a string 
of length n on an alphabet of size a is therefore 

B(») = E a) 

w£W n 

Similarly, the expected size of the longest border not exceeding k is 

B k( n ) = B k( w )- ( 2 ) 

W£W n 
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Clearly B(w) > B k (w ) so 


B{n) > B k (n). 


(3) 


Note that if n > 2k then W n = {uvw : u G W k , x G W n - 2 k, v G W k } and 
so 

” (uxv). 


B ^ n ) = ^ S Y Y Bk(i 

u£W k x&W n - 2 k v£W k 


Now B k (uxv ) = B k {uv) so if n > 2k, 

1 


B k («) = ^ E E 1 

«eVK fe weiy fe iew n _ 2 fc 


(4) 


(5) 


cr 


n—2k 


(T“ 


Y Y Bk ^ 


uv) 


<7 


1 

2 fc 


nGWt vGW k 

Y Bk h 

w£W 2 k 

= B k (2k). 

With @ we then have, for n > 2/c, 

L?(n) > B k {2k). 


( 6 ) 


Now any border that is counted in the right hand side of ([Tj) but not 
counted on the right hand side of ([2]) has length at least k + 1. The sum 
of the lengths of such borders is, by Lemma [9j 

n 

Y rna n ~ m . 

m=k -\-1 

So, by Lemma 10 and ([5]), 

_ i n 

B ( n ) ^ Y Bk H+ Y 


ma 


weWn 


m=k -\-1 


- , , 1 ,a n - k+1 k + a n - k+l -a n ~ k k-an-a + n, 

= Bk{n) + -) 

- , . a~ k+1 k + a~ k+1 - a~ k k - a 

< B k{n) +- - --2- 

(cj-1) 


= B k {2k) + 0{k<7 


-k\ 


( 7 ) 
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Thus for all n>2k 

B k {2k) < B{n) < B k (2k ) + 0{a~ k ) 

so they’re contained in an arbitrarily small interval. Call this interval I k 
and define J\ = I\ and for i > 2 Ji = Ii n Ji- Then J\, J 2 , ■.. is a 
sequence of nested intervals whose lengths have limit 0. By the Nested 
Intervals Theorem this means the limit of B n exists. 

Using ([ 3 ]) and <[t|) with k = 11 we find that limn^ooB^n) lies in the 
interval (1.6356,1.6420) for binary alphabets. For ternary alphabets using 
k = 6 the limit lies in (0.6811, 0.6864). 


